AITopics

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.98)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.69)
(2 more...)

arXiv.org Artificial IntelligenceDec-4-2025

Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting

Chen, Howard, Razin, Noam, Narasimhan, Karthik, Chen, Danqi

Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities -- a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.

large language model, machine learning, natural language, (16 more...)

2510.18874

Country: North America > United States (0.67)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Neural Information Processing SystemsOct-9-2025, 19:26:38 GMT

16c628ab12dc4caca8e7712affa6c767-Paper-Conference.pdf

algorithm, arxiv preprint arxiv, assumption 4, (12 more...)

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.98)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.69)
(2 more...)

Neural Information Processing SystemsOct-3-2025, 07:28:28 GMT

Constraining Variational Inference with Geometric Jensen-Shannon Divergence

In order to properly capitalise on the advantages of each divergence, it is also desirable that the meaning of scaling factors remains clear when combining multiple divergence terms.

divergence, interpolation, reverse kl, (15 more...)

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsAug-16-2025, 01:22:47 GMT

b8a6550662b363eb34145965d64d0cfb-AuthorFeedback.pdf

posterior belief, reverse kl, reviewer, (13 more...)

Genre: Research Report > New Finding (0.30)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.32)

arXiv.org Artificial IntelligenceFeb-16-2025

Revisiting Weak-to-Strong Generalization in Theory and Practice: Reverse KL vs. Forward KL

Yao, Wei, Yang, Wenkai, Wang, Ziqiao, Lin, Yankai, Liu, Yong

As large language models advance toward superhuman performance, ensuring their alignment with human values and abilities grows increasingly complex. Weak-to-strong generalization offers a promising approach by leveraging predictions from weaker models to guide stronger systems, but its effectiveness could be constrained by the inherent noise and inaccuracies in these weak predictions. To address this, we propose a theoretically grounded approach that replaces forward KL divergence-whose mass-covering behavior risks overfitting to imperfect weak signals-with reverse KL divergence. Reverse KL divergence's zero-forcing effect prioritizes high-confidence predictions, effectively mitigating the influence of unreliable weak supervision. Theoretically, we extend existing bounds and derive tighter lower bounds for both forward and reverse KL divergence, establishing that reverse KL achieves at least comparable guarantees to forward KL. Notably, when a sufficiently pre-trained strong model is fine-tuned on the last layer, reverse KL uniquely guarantees that it outperforms its weak supervisor by the magnitude of their disagreement-a guarantee that forward KL cannot provide. Empirically, we demonstrate that reverse KL and reverse cross-entropy enable strong models to consistently outperform those trained with forward KL and standard cross-entropy across most settings, highlighting the practical advantages of these reverse losses.

large language model, machine learning, natural language, (17 more...)

2502.11107

Country: Asia (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)

arXiv.org Artificial IntelligenceFeb-14-2025

Flow-based sampling for multimodal and extended-mode distributions in lattice field theory

Hackett, Daniel C., Hsieh, Chung-Chun, Pontula, Sahil, Albergo, Michael S., Boyda, Denis, Chen, Jiunn-Wei, Chen, Kai-Feng, Cranmer, Kyle, Kanwar, Gurtej, Shanahan, Phiala E.

Recent results have demonstrated that samplers constructed with flow-based generative models are a promising new approach for configuration generation in lattice field theory. In this paper, we present a set of training- and architecture-based methods to construct flow models for targets with multiple separated modes (i.e.~vacua) as well as targets with extended/continuous modes. We demonstrate the application of these methods to modeling two-dimensional real and complex scalar field theories in their symmetry-broken phases. In this context we investigate different flow-based sampling algorithms, including a composite sampling algorithm where flow-based proposals are occasionally augmented by applying updates using traditional algorithms like HMC.

artificial intelligence, configuration, machine learning, (16 more...)

2107.00734

Country:

North America > United States > Maryland (0.28)
North America > United States > Wisconsin (0.27)

Genre: Research Report > New Finding (1.00)

Industry:

Energy (0.67)
Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.92)

Neural Information Processing SystemsFeb-12-2025, 02:33:41 GMT

Reviews: Reverse KL-Divergence Training of Prior Networks: Improved Uncertainty and Adversarial Robustness

Detecting inputs that are outside the distribution of training examples, including adversarial inputs, is an important problem; reviewers and the area chair agree that this paper makes a useful algorithmic contribution towards solving this problem. The argument that reverse KL is conceptually correct, while forward KL as used previously is conceptually wrong, is significant. Training with reverse KL is a simple and compelling idea that practitioners can try easily. For these reasons the paper is being accepted so that the community can benefit from it quickly, despite the fact that reviewers have identified ways in which the writing of the paper, and the empirical evaluation, need improvement. The authors are encouraged to improve the final version.

improved uncertainty and adversarial robustness, kl-divergence training, reverse kl

Technology: Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.71)

Hagemann, Paul, Hertrich, Johannes, Casfor, Maren, Heidenreich, Sebastian, Steidl, Gabriele

Mixed Noise and Posterior Estimation with Conditional DeepGEM

arXiv.org Artificial IntelligenceFeb-5-2024

In numerous healthcare and other contemporary applications, the variables of primary interest are obtained through indirect measurements, such as in the case of Magnetic Resonance Imaging (MRI) and Computed Tomography (CT). For some of these applications, the reliability of the results is of particular importance. The accuracy and trustworthiness of the outcomes obtained through indirect measurements are significantly influenced by two critical factors: the degree of uncertainty associated with the measuring instrument and the appropriateness of the (forward) model used for the reconstruction of the parameters of interest (measurand). In this paper, we consider Bayesian inversion to obtain the measurand from signals measured by the instrument and a noise model that mimics both the instrument noise and the error of the forward model.

algorithm, deepgem, reconstruction, (16 more...)

2402.02964

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Germany > Berlin (0.04)

Genre: Research Report (0.64)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)

Zahid, Umais, Guo, Qinghai, Fountas, Zafeirios

Sample as You Infer: Predictive Coding With Langevin Dynamics

arXiv.org Artificial IntelligenceFeb-4-2024

It is well known that neuronal systems, including their dynamics and responses, are rife with noise at multiple levels (Faisal et al., 2008; Shadlen & Newsome, 1998). These sources of noise arise from, amongst other things, stochastic processes occuring at the sub-cellular level, impacting neuronal response through, for example, fluctuations in membrane-potential (Derksen & Verveen, 1966). Yet the precise role of such randomness, in information processing, continues to be an open question (McDonnell & Ward, 2011; Deco et al., 2013). The Langevin PC algorithm suggests one such role may be in the principled exploration of the latent space of hypotheses under one's generative model. Secondly, from the perspective of Langevin PC as an in-silico generative modelling algorithm we note a number of interesting avenues that we have not had the time to explore here. These include: Models with a hierarchy of stochastic variables, such as those found in most state of the art VAE models (Child, 2021; Vahdat & Kautz, 2021; Hazami et al., 2022). Which may require adopting a corresponding top-down hierarchical warm-start model. Automatic convergence criteria for determining when our Markov chain has converged to a certain level of error (Roy, 2020). Underdamped Langevin dynamics, which incorporate auxiliary momentum variables into the Langevin sampling to achieve an accelerated rate of convergence (Cheng et al., 2018; Ma et al., 2019).

algorithm, generative model, objective, (16 more...)

2311.13664

Country:

Asia > Middle East > Jordan (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
North America > United States > New York (0.04)
(2 more...)

Genre:

Instructional Material (0.46)
Research Report (0.43)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.95)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)